{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "nncp-splitter.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "authorship_tag": "ABX9TyP6w1zquku+OO4QX0cvwyFd"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UnGWX9rcuGF_"
      },
      "source": [
        "# NNCP Splitter"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v5LlpujkIffL"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/byronknoll/tensorflow-compress/blob/master/nncp-splitter.ipynb)\n",
        "\n",
        "Made by Byron Knoll. GitHub repository: https://github.com/byronknoll/tensorflow-compress\n",
        "\n",
        "### Description\n",
        "\n",
        "This notebook can be used to split files that have been preprocessed by NNCP. This is for compression using [tensorflow-compress](https://colab.research.google.com/github/byronknoll/tensorflow-compress/blob/master/tensorflow-compress.ipynb). The primary use-case is to get around Colab's session time limit by processing large files in smaller parts.\n",
        "\n",
        "This file splitting does not use the naive method of dividing the file into consecutive parts. Instead, it takes into account the batch size used in tensorflow-compress so that the same sequence of symbols will be used for compressing the split parts as for the original file.\n",
        "\n",
        "### Instructions\n",
        "\n",
        "1.   In tensorflow-compress, using \"preprocess_only\" mode, choose \"nncp\" preprocessor and download the result.\n",
        "2.   Upload the preprocessed file (named \"preprocessed.dat\") to this notebook, and download the split parts.\n",
        "3.   In tensorflow-compress, compress each split part sequentially, enabling the checkpoint option. Choose \"nncp-done\" as the preprocessor.\n",
        "4.   In tensorflow-compress, decompress each split part sequentially, enabling the checkpoint option. Choose \"nncp-done\" as the preprocessor.\n",
        "5.   Upload the decompressed parts to this notebook to reproduce the original file. The files should be named: part.0, part.1, ..., part.N. Also upload the original NNCP dictionary file (named \"dictionary.words\").\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nmHg0RhkhYjL"
      },
      "source": [
        "## Parameters"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aaZwD50lhaw8",
        "cellView": "form"
      },
      "source": [
        "batch_size = 96 #@param {type:\"integer\"}\n",
        "#@markdown >_Set this to the same value that will be used in tensorflow-compress._\n",
        "mode = 'split' #@param [\"split\", \"join\"]\n",
        "num_parts = 4 #@param {type:\"integer\"}\n",
        "#@markdown >_This is the number of parts the file should be split to._\n",
        "http_path = '' #@param {type:\"string\"}\n",
        "#@markdown >_The file from this URL will be downloaded. It is recommended to use Google Drive URLs to get fast transfer speed. Use this format for Google Drive files: https://drive.google.com/uc?id= and paste the file ID at the end of the URL. You can find the file ID from the \"Get Link\" URL in Google Drive. You can enter multiple URLs here, space separated._\n",
        "local_upload = False #@param {type:\"boolean\"}\n",
        "#@markdown >_If enabled, you will be prompted in the \"Setup Files\" section to select files to upload from your local computer. You can upload multiple files. Note: the upload speed can be quite slow (use \"http_path\" for better transfer speeds)._\n",
        "download_option = \"no_download\" #@param [\"no_download\", \"local\", \"google_drive\"]\n",
        "#@markdown >_If this is set to \"local\", the output files will be downloaded to your computer. If set to \"google_drive\", they will be copied to your Google Drive account (which is significantly faster than downloading locally)._"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ahlTtuuyho-N"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Y_WBbW1tiwIN",
        "cellView": "form"
      },
      "source": [
        "#@title Imports\n",
        "\n",
        "from google.colab import files\n",
        "from google.colab import drive\n",
        "import math"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aBioxnZ0iYRa",
        "cellView": "form"
      },
      "source": [
        "#@title Mount Google Drive\n",
        "if download_option == \"google_drive\":\n",
        "  drive.mount('/content/gdrive')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-rkLqRkLibrU",
        "cellView": "form"
      },
      "source": [
        "#@title Setup Files\n",
        "\n",
        "!mkdir -p \"data\"\n",
        "\n",
        "if local_upload:\n",
        "  %cd data\n",
        "  files.upload()\n",
        "  %cd ..\n",
        "\n",
        "if http_path:\n",
        "  %cd data\n",
        "  paths = http_path.split()\n",
        "  for path in paths:\n",
        "    !gdown $path\n",
        "  %cd ..\n",
        "\n",
        "if mode == \"join\":\n",
        "  !gdown --id 1EzVPbRkBIIbgOzvEMeM0YpibDi2R4SHD\n",
        "  !tar -xf nncp-2019-11-16.tar.gz\n",
        "  %cd nncp-2019-11-16/\n",
        "  !make preprocess\n",
        "  %cd .."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dWxSPBcEhvpO"
      },
      "source": [
        "## Run"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CCYTdoWIjSbS",
        "cellView": "form"
      },
      "source": [
        "#@title Split/Join\n",
        "\n",
        "if mode == \"split\":\n",
        "  input_path = \"data/preprocessed.dat\"\n",
        "  orig = open(input_path, 'rb').read()\n",
        "  int_list = []\n",
        "  for i in range(0, len(orig), 2):\n",
        "    int_list.append(orig[i] * 256 + orig[i+1])\n",
        "  file_len = len(int_list)\n",
        "  split = math.ceil(file_len / batch_size)\n",
        "  part_split = math.ceil(file_len / (num_parts * batch_size))\n",
        "  pos = 0\n",
        "  for i in range(num_parts):\n",
        "    output = []\n",
        "    for j in range(batch_size):\n",
        "      for k in range(part_split):\n",
        "        if pos + k >= split:\n",
        "          break\n",
        "        index = pos + (j*split) + k\n",
        "        if index >= file_len:\n",
        "          break\n",
        "        output.append(int_list[index])\n",
        "    pos += part_split\n",
        "    with open((\"data/part.\" + str(i)), \"wb\") as out:\n",
        "      for j in range(len(output)):\n",
        "        out.write(bytes(((output[j] // 256),)))\n",
        "        out.write(bytes(((output[j] % 256),)))\n",
        "\n",
        "if mode == \"join\":\n",
        "  file_len = 0\n",
        "  for i in range(num_parts):\n",
        "    part = open(\"data/part.\" + str(i), 'rb').read()\n",
        "    file_len += len(part) / 2\n",
        "  split = math.ceil(file_len / batch_size)\n",
        "  part_split = math.ceil(file_len / (num_parts * batch_size))\n",
        "  int_list = [0] * math.floor(file_len)\n",
        "  pos = 0\n",
        "  for i in range(num_parts):\n",
        "    part = open(\"data/part.\" + str(i), 'rb').read()\n",
        "    part_list = []\n",
        "    for j in range(0, len(part), 2):\n",
        "      part_list.append(part[j] * 256 + part[j+1])\n",
        "    index2 = 0\n",
        "    for j in range(batch_size):\n",
        "      for k in range(part_split):\n",
        "        if pos + k >= split:\n",
        "          break\n",
        "        index = pos + (j*split) + k\n",
        "        if index >= file_len:\n",
        "          break\n",
        "        int_list[index] = part_list[index2]\n",
        "        index2 += 1\n",
        "    pos += part_split\n",
        "  with open(\"data/output.dat\", \"wb\") as out:\n",
        "    for i in range(len(int_list)):\n",
        "      out.write(bytes(((int_list[i] // 256),)))\n",
        "      out.write(bytes(((int_list[i] % 256),)))\n",
        "  !./nncp-2019-11-16/preprocess d data/dictionary.words ./data/output.dat ./data/final.dat"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "I-kbmJAHE7Pp",
        "cellView": "form"
      },
      "source": [
        "#@title File Sizes\n",
        "!ls -l data"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "l2KkrU6KlB7N",
        "cellView": "form"
      },
      "source": [
        "#@title MD5\n",
        "!md5sum data/*"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6xFHXedoDGI8",
        "cellView": "form"
      },
      "source": [
        "#@title Download Result\n",
        "def download(path):\n",
        "  \"\"\"Downloads the file at the specified path.\"\"\"\n",
        "  if download_option == 'local':\n",
        "    files.download(path)\n",
        "  elif download_option == 'google_drive':\n",
        "    !cp -f $path /content/gdrive/My\\ Drive\n",
        "\n",
        "if mode == \"split\":\n",
        "  for i in range(num_parts):\n",
        "    download(\"data/part.\" + str(i))\n",
        "\n",
        "if mode == \"join\":\n",
        "  download(\"data/final.dat\")"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}